70 research outputs found

    Factorizing Probabilistic Graphical Models Using Co-occurrence Rate

    Get PDF
    Factorization is of fundamental importance in the area of Probabilistic Graphical Models (PGMs). In this paper, we theoretically develop a novel mathematical concept, \textbf{C}o-occurrence \textbf{R}ate (CR), for factorizing PGMs. CR has three obvious advantages: (1) CR provides a unified mathematical foundation for factorizing different types of PGMs. We show that Bayesian Network Factorization (BN-F), Conditional Random Field Factorization (CRF-F), Markov Random Field Factorization (MRF-F) and Refined Markov Random Field Factorization (RMRF-F) are all special cases of CR Factorization (CR-F); (2) CR has simple probability definition and clear intuitive interpretation. CR-F tells not only the scopes of the factors, but also the exact probability functions of these factors; (3) CR connects probability factorization and graph operations perfectly. The factorization process of CR-F can be visualized as applying a sequence of graph operations including partition, merge, duplicate and condition to a PGM graph. We further obtain an important result: by CR-F, on TCG graphs the scopes of factors can be exactly over maximal cliques without any default configuration. This improves the results of (R)MRF-F which need default configurations, and also indicates that (R)MRF-F, as special cases of CR-F, can not always achieve the optimal results of CR-F

    Concept Extraction Challenge: University of Twente at #MSM2013

    Get PDF
    Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we used AIDA \cite{YosefHBSW11} disambiguation system to disambiguate the extracted named entities and hence find their type

    Empirical co-occurrence rate networks for sequence labeling

    Get PDF
    Sequence labeling has wide applications in many areas. For example, most of named entity recog- nition tasks, which extract named entities or events from unstructured data, can be formalized as sequence labeling problems. Sequence labeling has been studied extensively in different commu- nities, such as data mining, natural language processing or machine learning. Many powerful and popular models have been developed, such as hidden Markov models (HMMs) [4], conditional Markov models (CMMs) [3], and conditional random fields (CRFs) [2]. Despite their successes, they suffer from some known problems: (i) HMMs are generative models which suffer from the mismatch problem, and also it is difficult to incorporate overlapping, non-independent features into a HMM explicitly. (ii) CMMs suffer from the label bias problem; (iii) CRFs overcome the problems of HMMs and CMMs, but the global normalization of CRFs can be very expensive. This prevents CRFs from being applied to big datasets (e.g. Tweets).\ud In this paper, we propose the empirical Co-occurrence Rate Networks (ECRNs) [5] for sequence la- beling. CRNs avoid the problems of the existing models mentioned above. To make the training of CRNs as efficient as possible, we simply use the empirical distribution as the parameter estimation. This results in the ECRNs which can be trained orders of magnitude faster and still obtain compet- itive accuracy to the existing models. ECRN has been applied as a component to the University of Twente system [1] for concept extraction challenge at #MSM2013, which won the best challenge submission awards. ECRNs can be very useful for practitioners on big data

    Named Entity Extraction and Linking Challenge: University of Twente at #Microposts2014

    Get PDF
    Twitter is a potentially rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing (NLP) tasks. In this paper, we present a hybrid approach for Named Entity Extraction (NEE)and Linking (NEL) for tweets. Although NEE and NEL are two topics that are well studied in literature, almost all approaches treated the two problems separately. We believe that disambiguation (linking) could help improving the extraction process. We call this potential for mutual improvement, the reinforcement effect. It mimics the way humans understand natural language. Furthermore, our proposed approaches handles uncertainties involved in the two processes by considering possible alternatives

    Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization

    Get PDF
    The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. As an alternative, piecewise training divides the full graph into pieces, trains them independently, and combines the learned weights at test time. In this paper, we present \emph{separate} training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.Comment: 10page

    Mechanism-based site-directed mutagenesis to shift the optimum pH of the phenylalanine ammonia-lyase from Rhodotorula glutinis JN-1

    Get PDF
    AbstractPhenylalanine ammonia-lyase (RgPAL) from Rhodotorula glutinis JN-1 stereoselectively catalyzes the conversion of the l-phenylalanine into trans-cinnamic acid and ammonia, and was used in chiral resolution of dl-phenylalanine to produce the d-phenylalanine under acidic condition. However, the optimum pH of RgPAL is 9 and the RgPAL exhibits low catalytic efficiency at acidic side. Therefore, a mutant RgPAL with a lower optimum pH is expected. Based on catalytic mechanism and structure analysis, we constructed a mutant RgPAL-Q137E by site-directed mutagenesis, and found that this mutant had an extended optimum pH 7ā€“9 with activity of 1.8-fold higher than that of the wild type at pH 7. As revealed by Friedelā€“Crafts-type mechanism of RgPAL, the improvement of the RgPAL-Q137E might be due to the negative charge of Glu137 which could stabilize the intermediate transition states through electrostatic interaction. The RgPAL-Q137E mutant was used to resolve the racemic dl-phenylalanine, and the conversion rate and the eeD value of d-phenylalanine using RgPAL-Q137E at pH 7 were increased by 29% and 48%, and achieved 93% and 86%, respectively. This work provides an effective strategy to shift the optimum pH which is favorable to further applications of RgPAL

    Empirical training for conditional random fields

    Get PDF
    In this paper (Zhu et al., 2013), we present a practi- cally scalable training method for CRFs called Empir- ical Training (EP). We show that the standard train- ing with unregularized log likelihood can have many maximum likelihood estimations (MLEs). Empirical training has a unique closed form MLE which can be calculated from the empirical distribution very fast. The MLE of the empirical training is also one MLE of the standard training. So empirical training can be competitive in precision to the standard training and piecewise training. And also we show that empirical training is unaffected by the label bias problem even it is a local normalized model. Experiments on two real- world NLP datasets also show that empirical training reduces the training time from weeks to seconds, and obtains competitive results to the standard and piece- wise training on linear-chain CRFs, especially when training data are insufficient

    A Short Note on Aberrant Responses Bias in Item Response Theory

    Get PDF
    Item response models often cannot calculate true individual response probabilities because of the existence of response disturbances (such as guessing and cheating). Many studies on aberrant responses under item response theory (IRT) framework had been conducted. Some of them focused on how to reduce the effect of aberrant responses, and others focused on how to detect aberrant examinees, such as person fit analysis. The purpose of this research was to derive a generalized formula of bias with/without aberrant responses, that showed the effect of both non-aberrant and aberrant response data on the bias of capability estimation mathematically. A new evaluation criterion, named aberrant absolute bias (|ABIAS|), was proposed to detect aberrant examinees. Simulation studies and application to a real dataset were conducted to demonstrate the efficiency and the utility of |ABIAS|

    //Rondje Zilverling: COMMIT/TimeTrails

    Get PDF
    Het TimeTrails-project3 gaat over data mining in grote hoeveelheden gegevens over gebeurtenissen in ruimte en tijd, d.w.z. met cooĢˆrdinaten en time-stamps. Dergelijke gegevens worden doorgaans vergaard door mensen, sensoren en wetenschappelijke observaties. Gegevensanalyse richt zich vaak op de vier Wā€™s: Wie, Wat, Waar en Wanneer. Een belangrijke kwestie is het kunnen behappen van de grote hoeveelheden gegevens, d.w.z. "big data". Vanuit de UT werken we, d.w.z. de groepen EWI/DB en ITC/GIP, aan twee applicaties:\ud * Het in kaart brengen van de mening van het publiek bij grote infrastructuurproject zoals de aanleg van een nieuw stuk snelweg. Dit doen we met Twitter-analyse en data-visualisatie.\ud ā€¢ Het vinden van goede vakantiebestemmingen. Hierbij spelen Social media, web harvesting en analyse van GPS-traces een rol
    • ā€¦
    corecore